-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Bugfix] [Core] Fix zero temperature case (#5404 and part of #5898) #12802
Conversation
supports both zero and nonzero temperatures in batch see vllm-project#5404 (comment) Signed-off-by: Stan Hatko <stan_hatko@live.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Stan Hatko <stan_hatko@live.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much appreciated the investigation! IMO this is quite an elegant patch - @WoosukKwon What do you think?
For the pre-commit hook error here, it shows missing SPDX headers in the file For the test with increased GPU VRAM usage, some increase makes sense due to the creation of additional tensors. The previous code just did an in-place divide after the conversion to float32, while the current code needs to compute tensors for both cases and sum them. I think correctness of outputs at zero temperature (a very common use case) is more important though than saving GPU VRAM at the sampling step. Also it may be possible to optimize the code to do more stuff in-place, but some new tensors are inevitable I think with doing it this way. However, it's possible to avoid the increased GPU RAM usage with a custom CUDA kernel. Basically re-write this as a CUDA kernel and call that. No new tensors will be needed as the computations will be done in temporary variables within a loop, with the final output values assigned directly to the output tensor. I think we should first merge this though and then implement the CUDA kernel as an optimization. I have no experience writing CUDA kernels (though I did write and optimize OpenCL code many years ago). |
Wow this is amazing! Thanks for the thorough investigation! |
Sorry for the delay in review. LGTM overall, but I'd like to understand the problem in more detail. Let me get back to this within 1~2 days. |
@StanHatko thank you for the investigation. To me this looks like a workaround to an underlying problem of doing argmax on the logprobs for greedy. Could you try out this PR which fixes that? #13312 At a minimum I think this should avoid the need for the logits_z = is_zero * logits Man I forgot how convoluted our v0 sampler code was 😅 |
Yes fixing the sampler to be deterministic in the temperature = 0 case seems like a better solution, ideally by just taking the largest logit in that case everything here can be avoided. I'll check if that pull request fixes the issue. |
Unfortunately I'm having trouble reproducing some of the old non-deterministic runs from before. There were a bunch of changes since and the prompts here were long, complex, and had data that can't be released publicly. With simple prompts I didn't see the non-determinism in some basic tests. If anyone has a minimum working example please post it. Some performance optimizations for throughput I enabled (in particular enable prefix caching, sorting before feeding in to take advantage of prefix caching, enable prefill chunking, and CUDA graph optimization) together seemed to reduce the amount of non-determinism (went down from 3, 4, 1, 2, 3 unique responses for 5 different queries each repeated 100 times down to 1, 2, 1, 1, 2). The new PR looks good. I agree the sampler itself is the better place to make the changes, the reason I didn't do that myself is that I wasn't familiar with the sampler code but understood the logits and logprobs code. For the greedy sampling, is it guaranteed the flag will be set for zero-temperature entries? If there's a batch with a mixture of zero and nonzero temperatures, will it use greedy sampling for the zero-temperature entries and regular sampling for the rest? |
Thanks @StanHatko
@StanHatko yes that's correct, and greedy requests are determined by |
This pull request has merge conflicts that must be resolved before it can be |
Fixes nondeterminism from the sampler that occurs even at zero temperature. The patch supports both zero and nonzero temperatures for different entries in the same batch, using the correct operation in each case. It does this by multiplying by an is-zero indicator for the zero temperature calculation and is-nonzero indicator for positive temperature case. These two are then summed. For any entry, only one term in the sum is nonzero as the two cases are mutually exclusive.
I created this patch and discussed it in #5404 (comment). I used this patch internally and it greatly reduced variation in answers at zero temperature (remaining nondeterminism I think is due to CUDA nondeterminacy in the steps leading up to calculation of the logits).
Please feel free to make any improvements to this patch you can think of.
FIX #5404 (patch discussed there).
FIX #5898 (part from sampler, nondeterminism from the computation of the logits themselves not covered here).